feat(benchmarks): competitive comparison docs and JSON output (closes #7)#18
feat(benchmarks): competitive comparison docs and JSON output (closes #7)#18
Conversation
Run npm audit fix to patch all non-breaking security issues: - next: 15.5.11 → 15.5.14 (HTTP request smuggling, image cache growth) - axios: DoS via __proto__ in mergeConfig - socket.io-parser: unbounded binary attachments - serialize-javascript: RCE via RegExp/Date - storybook: dev server WebSocket hijacking - lodash: prototype pollution + code injection - brace-expansion, minimatch, picomatch, ajv: ReDoS vulnerabilities 10 low-severity vulnerabilities remain in deep transitive chains (elliptic via @storybook/nextjs, @tootallnate/once via jest-environment-jsdom) that require breaking changes to fix and affect dev tooling only. Closes #2
… issue #7 - Update METAL_GPU_RESULTS.md with competitive comparison table (vs llama.cpp, Ollama, LM Studio), projected throughput across all Apple Silicon variants based on memory bandwidth scaling, and a community contribution guide for other hardware owners - Add --output-json flag to bench CLI for machine-readable result export - Expand ignored test_inference_performance to report full metric set (hardware info, load time, per-run breakdown, latency stats) - Add scripts/benchmark_metal.sh for standardized one-command benchmarking Closes #7
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 35872e6a44
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| echo "Tokens: $TOKENS per iteration" | ||
| echo "" | ||
|
|
||
| cargo run --release -- bench \ |
There was a problem hiding this comment.
Add gguf feature flag when invoking benchmark command
The new benchmark script calls cargo run --release -- bench without enabling any model backend feature, so in a default checkout it builds with default features only (which exclude gguf) and bench --model *.gguf cannot resolve a backend at runtime. This makes the advertised one-command Metal benchmark fail for users unless they manually add features; invoke Cargo with --features gguf (or an equivalent feature set) here to make the script work out of the box.
Useful? React with 👍 / 👎.
| let mut sys = System::new_all(); | ||
| sys.refresh_all(); | ||
| let hostname = sys.host_name().unwrap_or_else(|| "unknown".to_string()); | ||
| let total_memory_gb = sys.total_memory() as f64 / 1_073_741_824.0; |
There was a problem hiding this comment.
Convert sysinfo memory units before reporting total RAM
sysinfo in this repo is pinned to 0.29.x, where total_memory() is reported in KiB; dividing by 1_073_741_824.0 treats the value as bytes and underreports RAM by ~1024x (for example, typical 32–64 GB machines print near 0.0 GB). Because this test output is intended for benchmark submissions, the hardware metadata becomes misleading unless the conversion uses KiB→GiB scaling.
Useful? React with 👍 / 👎.
- Fix inverted token heuristic in performance test: was dividing by 0.75 (inflating count ~33%); now uses chars/4 to match bench CLI exactly - Fix median calculation to handle partial inference failures: use proper odd/even formula instead of len()/2 which was wrong for even arrays - Add hardware fields to BenchmarkJsonResult (hostname, total_memory_gb, os_version) so JSON exports are self-contained for issue #7 contributors - Validate --output-json parent directory exists before running benchmark to avoid a cryptic OS error after a full run completes - Fix sysinfo memory divisor to use 1_073_741_824.0 per CLAUDE.md
Summary
METAL_GPU_RESULTS.mdwith a competitive comparison table (Inferno vs llama.cpp, Ollama, LM Studio on M4 Max), projected throughput across all Apple Silicon variants based on memory bandwidth scaling, and a community contribution guide--output-json <file>flag tocargo run -- benchfor machine-readable result exporttest_inference_performancetest to report the full metric set required by Test Metal GPU Performance Across Apple Silicon Variants #7: hardware info, load time, per-run throughput, latency statsscripts/benchmark_metal.sh— one-command standardized benchmark that writes a JSON results file contributors can share in Test Metal GPU Performance Across Apple Silicon Variants #7Test plan
cargo test --lib— 900 tests passcargo test --test metal_performance_tests --features gguf— 4 pass, 1 ignored (requires real model)cargo run --release -- bench --model <gguf> --output-json /tmp/result.jsonto verify JSON output